Video captioning with stacked attention and semantic hard pull
نویسندگان
چکیده
Video captioning, i.e. , the task of generating captions from video sequences creates a bridge between Natural Language Processing and Computer Vision domains computer science. The semantically accurate description is quite complex. Considering complexity, problem, results obtained in recent research works are praiseworthy. However, there plenty scope for further investigation. This paper addresses this proposes novel solution. Most captioning models comprise two sequential/recurrent layers—one as video-to-context encoder other context-to-caption decoder. architecture, namely Semantically Sensible Captioning (SSVC) which modifies context generation mechanism by using approaches—“stacked attention” “spatial hard pull”. As no exclusive metrics evaluating models, we emphasize both quantitative qualitative analysis our model. Hence, have used BLEU scoring metric proposed human evaluation analysis, Semantic Sensibility (SS) metric. SS Score overcomes shortcomings common automated metrics. reports that use aforementioned novelties improves performance state-of-the-art architectures.
منابع مشابه
Video Captioning with Multi-Faceted Attention
Recently, video captioning has been attracting an increasing amount of interest, due to its potential for improving accessibility and information retrieval. While existing methods rely on different kinds of visual features and model structures, they do not fully exploit relevant semantic information. We present an extensible approach to jointly leverage several sorts of visual features and sema...
متن کاملHierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., ”gun” and ”shooting”) and non-visual words (e.g. ”the”, ”a”). However, these non-visual words can be easily predicted using natural language model without considering visual...
متن کاملImage Captioning with Attention
In the past few years, neural networks have fueled dramatic advances in image classi cation. Emboldened, researchers are looking for more challenging applications for computer vision and arti cial intelligence systems. They seek not only to assign numerical labels to input data, but to describe the world in human terms. Image and video captioning is among the most popular applications in this t...
متن کاملSpatio-Temporal Attention Models for Grounded Video Captioning
Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video da...
متن کاملMulti-Task Video Captioning with Video and Entailment Generation
Video captioning, the task of describing the content of a video, has seen some promising improvements in recent years with sequence-to-sequence models, but accurately learning the temporal and logical dynamics involved in the task still remains a challenge, especially given the lack of sufficient annotated data. We improve video captioning by sharing knowledge with two related directed-generati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: PeerJ
سال: 2021
ISSN: ['2167-8359']
DOI: https://doi.org/10.7717/peerj-cs.664